Unsupervised learning of natural languages.

نویسندگان

  • Zach Solan
  • David Horn
  • Eytan Ruppin
  • Shimon Edelman
چکیده

We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The adios (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient, Correct, Unsupervised Learning for Context-Sensitive Languages

A central problem for NLP is grammar induction: the development of unsupervised learning algorithms for syntax. In this paper we present a lattice-theoretic representation for natural language syntax, called Distributional Lattice Grammars. These representations are objective or empiricist, based on a generalisation of distributional learning, and are capable of representing all regular languag...

متن کامل

An algorithm for the unsupervised learning of morphology

This paper describes in detail an algorithm for the unsupervised learning of natural language morphology, with emphasis on challenges that are encountered in languages typologically similar to European languages. It utilizes the Minimum Description Length analysis described in Goldsmith 2001 and has been implemented in software that is available for downloading and testing. 1. Scope of this pap...

متن کامل

Two Approaches for Building an Unsupervised Dependency Parser and Their Other Applications

Much work has been done on building a parser for natural languages, but most of this work has concentrated on supervised parsing. Unsupervised parsing is a less explored area, and unsupervised dependency parser has hardly been tried. In this paper we present two approaches for building an unsupervised dependency parser. One approach is based on learning dependency relations and the other on lea...

متن کامل

Unsupervised Language Acquisition: Theory and Practice

In this thesis I present various algorithms for the unsupervised machine learning of aspects of natural languages using a variety of statistical models. The scientific object of the work is to examine the validity of the so-called Argument from the Poverty of the Stimulus advanced in favour of the proposition that humans have language-specific innate knowledge. I start by examining an a priori ...

متن کامل

Modeling Acquisition of Word Structure with Lexicalized Grammar Learning

Introduction This paper introduces a framework for learning structure in natural languages, and reports results from a simple application of it to learning word-syntax of an agglutinative language in an unsupervised manner. Arguably, the learning environment of children acquiring languages provides more information—by means of linguistic interaction and extralinguistic information present in th...

متن کامل

Modeling Acquisition of Word Structure with Lexicalized Grammar Learning

This paper introduces a framework for learning structure in natural languages, and reports results from a simple application of it to learning word-syntax of an agglutinative language in an unsupervised manner. Arguably, the learning environment of children acquiring languages provides more information—by means of linguistic interaction and extralinguistic information present in the learning se...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Proceedings of the National Academy of Sciences of the United States of America

دوره 102 33  شماره 

صفحات  -

تاریخ انتشار 2005